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Abstract 

This  paper  presents  a  method  for  bootstrapping  a 
fine-grained,  broad-eoverage  part-of-speeeh  (POS) 
tagger  in  a  new  language  using  only  one  person- 
day  of  data  aequisition  effort.  It  requires  only  three 
resourees,  whieh  are  eurrently  readily  available  in 
60-100  world  languages:  (1)  an  online  or  hard-eopy 
pocket-sized  bilingual  dictionary,  (2)  a  basic  library 
reference  grammar,  and  (3)  access  to  an  existing 
monolingual  text  corpus  in  the  language.  The  al¬ 
gorithm  begins  by  inducing  initial  lexical  POS  dis¬ 
tributions  from  English  translations  in  a  bilingual 
dictionary  without  POS  tags.  It  handles  irregular, 
regular  and  semi-regular  morphology  through  a  ro¬ 
bust  generative  model  using  weighted  Levenshtein 
alignments.  Unsupervised  induction  of  grammatical 
gender  is  performed  via  global  modeling  of  context- 
window  feature  agreement.  Using  a  combination  of 
these  and  other  evidence  sources,  interactive  train¬ 
ing  of  context  and  lexical  prior  models  are  accom¬ 
plished  for  fine-grained  POS  tag  spaces.  Experi¬ 
ments  show  high  accuracy,  fine-grained  tag  resolu¬ 
tion  with  minimal  new  human  effort. 

1  Introduction 

Previous  work  in  minimally  supervised  language 
learning  has  defined  minimal  using  several  different 
criteria.  Some  have  assumed  only  partially  tagged 
training  corpora  (Merialdo,  1994),  while  others 
have  begin  with  small  tagged  seed  wordlists  (such 
as  Collins  and  Singer  (1999)  and  Cucerzan  and 
Yarowsky  (1999)  for  named-entity  tagging).  Oth¬ 
ers  have  exploited  the  automatic  transfer  of  some 
already  existing  annotated  resource  in  a  different 
medium  or  language  (such  as  the  translingual  pro¬ 
jection  of  part-of-speech  tags,  syntactic  bracket¬ 
ing  and  inflectional  morphology  in  Yarowsky  et  al. 
(2001),  requiring  no  direct  supervision  in  the  for¬ 
eign  language).  Ngai  and  Yarowsky  (2000)  ob¬ 
served  that  an  often  more  practical  measure  of  the 
degree  of  supervision  is  not  simply  the  quantity  of 


annotated  words,  but  the  total  weighted  human  la¬ 
bor  and  resource  costs  of  different  modes  of  su¬ 
pervision  (allowing  manual  rule  writing  to  be  com¬ 
pared  directly  with  active  learning  on  a  common 
cost-performance  learning  curve). 

In  this  paper  we  observe  that  another  useful  mea¬ 
sure  of  (minimal)  supervision  is  the  additional  cost 
of  obtaining  a  desired  functionality  from  existing 
commonly  available  knowledge  sources.  In  particu¬ 
lar,  we  note  that  for  a  remarkably  wide  range  of  lan¬ 
guages,  academic  libraries,  many  booksellers  and 
websites  offer  a  foundation  of  linguistic  wisdom  in 
reference  grammars  and  dictionaries.  Thus  starting 
from  this  baseline,  what  is  the  marginal  cost  of  dis¬ 
tilling  from  and  augmenting  this  existing  knowledge 
to  achieve  a  desired  new  task  functionality? 

2  Inducing  POS  Tag  Candidates  from 
Unlabeled  Bilingual  Dictionaries 

A  substantial  percentage  of  foreign  language  dic¬ 
tionaries  that  are  available  on  line  or  in  smaller  pa¬ 
perback  format  are  simple  bilingual  word  or  phrase 
translation  lists  which  fail  to  specify  part  of  speech.^ 

Thus  one  component  question  of  this  work  is  how 
can  one  extract  preliminary  part-of-speech  distribu¬ 
tions  from  untagged  monolingual  translation  lists. 
Eigure  1  illustrates  such  a  bilingual  dictionary,  also 
specifying  the  true  part  of  speech  for  each  possible 
translation,  which  we  do  not  assume  to  be  generally 
available. 

One  approach  is  to  take  an  unweighted  mixture 
of  the  prior  part-of-speech  distributions  for  the  En¬ 
glish  words  Cj  given  in  the  translation  list  (TE)  as 
illustrated  in  Eigure  2.  These  probabilities  may  be 
estimated  from  a  large  and  preferably  balanced,  cor¬ 
pus.  In  this  work,  we  used  statistics  from  the  Brown 
and  WSJ  corpora  combined. 

'in  this  section,  we  will  use  the  term  POS  tag  to  denote 
only  the  main  part-of-speech  tags  (noun,  verb,  adjective,  ad¬ 
verb,  preposition,  etc.)  and  not  the  fine-grained  tags  (such  as 
Noun-Genitive- f em-plur-def ). 
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Romanian 

True 

POS 

English  translation  list 

mandat 

N 

warrant;  proxy;  mandate; 
money  order; 
power  of  attorney 

maneehin 

N 

model,  dummy 

manifesta 

V 

arise,  express  itself,  show 

manual 

Adj 

manual; 

N 

manual;  textbook; 
handbook 

mare 

Adj 

large;  big;  great;  tall; 
old;  important; 

N 

sea 

maro 

Adj 

brown,  ehestnut 

Figure  1:  A  sample  Romanian-English  dietionary. 
The  POS  tags  are  used  only  for  evaluation  and  are 
not  available  in  many  bilingual  dietionaries. 


Figure  2:  Indueing  a  preliminary  POS  distribution 
for  the  Romanian  word  mandat  via  a  simple  English 
translation  list. 

However,  when  a  translation  eandidate  is  phrasal 
(e.g.  mandat  •<-)■  money  order),  one  ean  model  the 
more  general  probability  of  the  foreign  word’s  part 
of  speeeh  tag  (Ty)  given  the  part  of  speeeh  sequenee 
of  the  English  phrasal  translation  {Tgj^...Tg^). 
Eor  example,  one  eould  model  P(Tf\money  or¬ 
der)  via  P(Tf\NgNg)  and  P(T f\manifest  itself)  via 
P(T fVgPrOg).  However,  beeause  English  words 
often  have  multiple  parts  of  speeeh  (e.g.  order  may 
be  a  verb),  one  may  weight  phrasal  POS  sequenee 
probabilities  (making  an  independenee  assumption) 
as: 

P{N f\money  order)  = 

P{Nf\NgNg)  ■  P{Ng\money)  ■  P{Ng\order)-\- 
P{N f\NgVg)  ■  P{Ng\money)  ■  P{Vg\order)-\- 
P{N f\VgNg)  ■  plVg\money)  ■  P{Ng\order)-\- 
P{N f\VgVg)  ■  P{Vg\money)  ■  P{Vg\order)-\- 

And  in  general: 

P{Tf\wg,Wg^)  = 

^ i'^fl'^ei'Pg^)  •  P{Tei  )  '  -^(^62  \'^e2  ) 


where  P{Tg^\wg^)  is  estimated  from  the  dietionary 
as  above.  Without  an  independenee  assumption: 

P{Tf\wg,...WgJ  = 

P(Ty|Tgi ...Tg^)  •  P{Tg^...Tgf\Wg^...Wg,fj 

There  are  two  major  options  via  whieh  one  ean 
estimate  P{Tf\Tg^...Tg^).  The  first  is  to  assume 
that  the  part-of-speeeh  usage  of  phrasal  (English) 
translations  is  generally  eonsistent  aeross  dietionar¬ 
ies  (e.g.  P{N f\Ng^Ngfj  remains  high  regardless 
of  publisher  or  language).  Henee  one  eould  use 
any  foreign-English  bilingual  dietionary  that  also 
ineludes  the  true  foreign  word  part  of  speeeh  in  ad¬ 
dition  to  its  translations  to  train  these  probabilities. 
Alternately,  one  eould  do  a  first-pass  assignment 
of  foreign-word  part  of  speeeh  based  on  only  sin¬ 
gle  word  translations  as  in  Eigure  2,  and  use  this  to 
train  P{Tf\Tg^...Tg^)  for  those  foreign  words  hav¬ 
ing  both  phrasal  and  single-word  definitions  (sueh 
as  mandat).  The  advantage  of  this  approaeh  is  that  it 
may  benefit  dietionaries  with  different  phrasal  trans¬ 
lation  styles  from  the  training  dietionary  (e.g.  use 
or  omission  of  the  word  ’to’  in  verb  definitions). 
However,  given  the  assumption  of  relatively  eonsis¬ 
tent  dietionary  formatting  styles  (whieh  was  unfor¬ 
tunately  not  the  ease  for  Kurdish),  we  evaluated  this 
work  based  on  supervised  phrasal  training  from  a 
single  independent  third  language  dietionary. 

Table  1  measures  the  POS  induetion  performanee 
on  three  languages,  where  the  true  POS  tags  were 
given  in  the  dietionary  (as  in  Eigure  1),  but  ignored 
exeept  for  evaluation.  The  aeeuraey  values  in  this 
table  are  based  on  exaet  matehes  between  a  word’s 
dietionary-provided  POS  and  the  most  probable  tag 
in  its  indueed  distribution. 

Eor  our  target  applieation  of  part-of-speeeh  tag¬ 
ging,  what  matters  is  to  have  a  robust  tag  probabil¬ 
ity  distribution  that  ineludes  the  true  eandidate  with 
suffieiently  large  probability  to  seed  further  train¬ 
ing.  By  setting  this  baseline  threshold  to  0.1  and 
deleting  lower  ranked  eandidates,  up  to  98%  of  the 
true  POS  were  found  to  be  above  this  threshold  and 
henee  were  eonsidered  in  future  training. 

The  Mean  Probability  of  Truth,  as  shown  in  Ta¬ 
ble  1,  is  another  measure  of  the  quality  of  the  POS 
predietions  made  by  the  algorithm,  representing  the 
probability  mass  assoeiated  with  the  true  POS  tag 
averaged  over  all  words. 

In  some  eases  the  algorithm  eould  not  prediet  a 
POS  tag,  primarily  due  to  English  translations  for 
whieh  no  POS  distribution  was  known  (often  an  ob- 
seure  word,  proper  name  or  OCR  error).  This  oe- 


Target 

Eanguage 

Training 

Dictionary 

Accuracy 
Exact  POS 

Correct  POS 
Over  Threshold 

Coverage 

Mean  Probability 
of  Truth 

Romanian 

Spanish  -  English 

92.9 

97.8 

98 

.91 

Kurdish 

Spanish  -  English 

76.8 

93.1 

95 

.82 

Spanish 

Romanian  -  English 

83.3 

94.9 

97 

.86 

Table  1 :  Performance  of  inducing  candidate  part-of-speech  distributions  derived  solely  from  untagged  En¬ 
glish  translation  lists.  Results  are  measured  by  type  (all  dictionary  entries  are  weighted  equally). 


casional  omission  is  measured  by  the  coverage  col¬ 
umn. 

Most  of  the  observed  errors  are  due  to  differences 
in  phrasal  definitional  conventions  in  the  training 
and  testing  dictionaries,  long  phrasal  idioms,  single¬ 
word  definitions  with  ambiguous  English  parts-of- 
speech  and  OCR  errors.  The  Kurdish  dictionary  was 
particularly  hindered  by  frequent  long  phrasal  trans¬ 
lations  which  often  included  an  explanation  or  def¬ 
inition  in  their  translation.  Because  all  dictionary 
entries  are  equally  weighted,  errors  on  rare  words 
such  as  mythological  characters  or  kinship  terms 
can  substantially  downgrade  performance.  But  for 
the  purposes  of  providing  seed  POS  distributions  to 
context-sensitive  taggers,  performance  is  quite  ade¬ 
quate  for  this  follow-on  task. 

3  Inducing  Morphological  Analyses 

There  has  been  extensive  previous  work  in  the 
supervised  and  minimally  supervised  induction  of 
both  affix  paradigms  (e.g.  Goldsmifh,  2000;  Snover 
and  Brenf,  2001)  and  diverse  models  of  regular  and 
irregular  concafenafive  and  non-concafenafive  mor¬ 
phology  (e.g.  Schone  and  Jurafsky,  2000;  van  den 
Bosch  and  Daelemans,  1999;  Yarowsky  and  Wicen- 
fowski,  2000).  While  such  approaches  are  impor- 
fanf  from  fhe  perspecfive  of  learning  fheory  or  broad 
coverage  handling  of  irregular  forms,  anofher  pos¬ 
sible  paradigm  for  minimal  supervision  is  fo  begin 
wifh  whafever  knowledge  can  be  efficienlly  manu¬ 
ally  entered  from  fhe  grammar  book  in  several  hours 
work. 

We  defined  such  grammar-based  “supervision”  as 
enfry  of  regular  inflectional  affix  changes  and  fheir 
associated  parf  of  speech  in  sfandardized  ordering  of 
fine-grained  affribufes,  as  in  Table  2  for  Spanish  and 
Romanian.  The  full  fables  have  approximately  200 
lines  each  and  required  roughly  1.5-2  person-hours 
for  enfry. 

Given  a  dicfionary  marked  wifh  core  parfs  of 
speech,  if  is  frivial  fo  generafe  hypofhesized  in- 
flecfed  forms  following  fhe  regular  paradigms,  as 
shown  in  fhe  lefl  size  of  Eigure  3.  However,  due 
fo  irregularifies  and  semi-regularifies  such  as  sfem- 


Root 

Affix 

Inflected 

Affix 

Part-of-speech  Tag 

1  Spanish;  | 

o$ 

0$ 

Adj-masc-sing 

o$ 

os$ 

Adj-masc-plur 

o$ 

a$ 

Adj-fem-sing 

o$ 

as$ 

Adj-fem-plur 

e$ 

e$ 

Adj-masc,fem-sing 

e$ 

es$ 

Adj-masc,fem-plur 

ar$ 

0$ 

Verb-Indic_Pres-p  1  -sing 

ar$ 

as$ 

Verb-Indic_Pres-p2-sing 

ar$ 

a$ 

Verb-Indic_Pres-p3-sing 

ar$ 

amos$ 

Verb-Indic_Pres-p  1  -plur 

ar$ 

ais$ 

Verb-Indic_Pres-p2-plur 

ar$ 

an$ 

Verb-Indic_Pres-p3-plur 

1  Romanian;  | 

a$ 

e$ 

Noun-Nomin-p3-fem-plur-indef 

e$ 

i$ 

Noun-Nomin-p3-fem-plur-indef 

ea$ 

ele$ 

Noun-Nomin-p3-fem-plur-indef 

i$ 

ile$ 

Noun-Nomin-p3-fem-plur-indef 

a$ 

ale$ 

Noun-Nomin-p3-fem-plur-indef 

$ 

$ 

Adj-masc,neut-sing 

$ 

a$ 

Adj-fem-sing 

$ 

i$ 

Adj-masc,neut,fem-plur 

$ 

e$ 

Adj  -fem,neut-plur 

ru$ 

ra$ 

Adj-fem-sing 

ru$ 

ri$ 

Adj  -masc,neut,fem-plur 

ru$ 

re$ 

Adj-fem-plur 

e$ 

$ 

Verb-Indic_Pres-pl  -sing 

e$ 

i$ 

Verb-Indic_Pres-p2-sing 

e$ 

e$ 

Verb-Indic_Pres-p3-sing 

e$ 

em$ 

Verb-Indic_Pres  -p  1  -plur 

e$ 

eti$ 

Verb-Indic_Pres  -p2-plur 

e$ 

$ 

Verb-Indic_Pres-p3-plur 

Table  2:  Sample  extracted  regular  inflectional 

paradigms  (suffix  context  is  marked  by  $). 


changes,  such  generation  will  clearly  have  substan¬ 
tial  inaccuracies  and  overgenerations. 

However,  through  weighted-Eevenshtein-based 
iterative  alignment  models,  such  as  described  in 
Yarowsky  and  Wicentowski  (2000),  one  can  per¬ 
form  a  probabilistic  string  match  from  all  lexical  to¬ 
kens  actually  observed  in  a  monolingual  corpus,  as 


Dictionary 

Rootword 


Regular 

Inflection 

Generation 


Observed 

Corpus 

Words 


fV-pres-3pl 
V-pret-lsg 
V-subj-3pl 

^  V-pres-lsg 
V-pres-3sg 
destmirA^  , , 

V-pret-lsg 

^  V-pres-lsg 


dormirA^ « 


doler/V 


V-pres-lsg 

V-imprf-3pl 

V-pret-3pl 

V-pres-3pl 

V-pres-3pl 

V-pret-3pl 


destrozan 

destroze 

destrozen 

destme  , 
destmen  - 
destmi  ^ 
destmo  - 


dormi'an 

dormio 

dormen 


-  destroce 

-  destrocen 
'  destrozan 

destrm 

destruye 

destruyen 

destruyo 

duermo 
duermen 
,  duelen 

'  dormi'an 

-  durmio 


Figure  3:  Inflectional  analysis  induction  via 
weighted  string  alignment  to  noisy  generations  from 
dictionary  roots  under  regular  paradigms 

in  the  right  side  of  Figure  3^. 

For  example,  when  looking  for  a  potential  anal¬ 
ysis  path  for  the  Spanish  irregular  inflection  de¬ 
strocen,  the  closest  string  match  is  the  regular  hy¬ 
pothesis  destrozar/V  ^  t/e5trozen/V-pres_subj-3pl. 
Likewise,  the  closest  string  match  for  destruyen  is 
destruir/V  ^  r/eitrnen/V-presJndic-Spl.  The  dif¬ 
ferences  between  these  regular  hypotheses  and  ob¬ 
served  inflected  forms  are  the  relatively  productive 
stem  changes  0— and  z^c,  neither  of  which  was 
listed  in  the  inflectional  supervision  table,  and  yet 
they  were  correctly  handled.  Note  that  a  traditional 
P(POS  I  suffix)  model  would  fail  to  handle  this  case 
given  that  the  common  inflection  suffix  -en  corre¬ 
sponds  to  two  different  parts  of  speech  here  (present 
indicative  or  subjunctive  depending  on  -ir  or  -ar 
paradigm). 

Also  note  that  the  irregular  stem  change  pro¬ 
cesses  such  as  dormir^duermen  have  a  correct 
best-fit  analysis,  despite  the  absence  of  any  internal 
stem  change  exemplars  (e.g.  o^ue)  in  the  human¬ 
generated  inflectional  supervision  table. 

For  further  robustness,  the  consensus  model  of 
P{Posi\FW)  is  estimated  as  a  weighted  mixture  of 
the  part-of-speech  tags  of  the  most  closely  aligned 


pseudo-regular  generated  inflections. 

The  inflections  of  closed-class  words  (such  as 
pronouns,  determiners  and  auxiliary  verbs)  are  not 
well  handled  by  this  generative-alignment  model, 
both  due  to  their  often  very  high  irregularity  (e.g. 
the  Spanish  verb  ser  (to  be))  and/or  their  typ¬ 
ical  shortness  (e.g.  the  pronominal  inflections 
of  mi,  tu,  su).  Thus  as  one  final  amount  of 
supervision,  lists  of  closed-class  words,  paired 
with  their  inflections  and  fine-grained  part-of- 
speech  tags  were  entered  manually  from  the  gram¬ 
mar  book  (e.g.  aquellas#  (aquel )  Adj_Dem- 
f  em-plur-p3).  This  final  source  of  supervision 
utilized  an  average  of  400  lines  and  3  person-hours 
per  language. 

4  POS  Model  Induction 

The  non-traditional  supervision  methodology  in 
Sections  2  and  3  yields  a  noisy  but  broad-coverage 
candidate  space  of  parts  of  speech  with  little  human 
effort. 

We  then  perform  a  noise-robust  combination  of 
model  estimation  and  re-estimation  techniques  for 
the  syntagmatic  trigram  models  P{pos2\posi,posQ) 
and  lexical  priors  P{wi\posj)  using  the  word  co¬ 
occurrence  information  from  a  raw  corpus. 

•  A  suffix-based  part-of-speech  probability 
model  P (pos j\suffix{wi))  using  hierarchically 
smoothed  tries  is  trained  on  the  raw  initial 
tag  distributions,  yielding  coverage  to  unseen 
words  and  smoothing  of  low-confidence  initial 
tag  assignments. 

•  Paradigmatic  cross-context  tag  modeling  is 
performed  as  in  Cucerzan  and  Yarowsky 
(2000)  when  sufficiently  large  unannotated 
corpora  are  available. 

•  Sub-part-of-speech  contextual  agreement  for 
features  such  as  gender  is  performed  as  de¬ 
scribed  in  Section  4. 1 . 

•  The  part-of-speech  tag  sequence  models 
P{pos2\posi,posQ)  Utilize  a  weighted  backoff 
between  fine-grained  and  coarse-grained  tags. 

•  Both  the  tag-sequence  and  lexical  prior  models 
are  iteratively  retrained  using  these  additional 
evidence  sources  and  first-pass  probability  dis¬ 
tributions. 


^For  processing  efficiency,  one  additional  constraint  is  that 
potential  hypothesized-H-observed  string  pair  candidates  must 
exactly  match  in  both  initial  consonant  cluster  and  suffix  of  the 
generated  hypothesis. 


The  success  of  this  model  is  based  on  the  as¬ 
sumption  that  (a)  words  of  the  same  part  of  speech 
tend  to  have  similar  tag  sequence  behavior,  and  (b) 


there  are  suffieient  instanees  of  eaeh  POS  tag  la¬ 
beled  by  either  the  morphology  models  or  elosed- 
elass  entries  deseribed  in  Seetion  3.  One  example 
where  these  assumptions  do  not  hold  is  for  the  Ro¬ 
manian  word  a,  whieh  has  5  possible  POS  tags,  in¬ 
eluding  Inf  initive_Marker (eorresponding  to 
the  English  word  to).  But  beeause  the  Inf  ini - 
tive_Marker  tag  has  no  other  word  instanees  in 
Romanian,  no  other  filial  supervision  exists  to  re¬ 
solve  the  ambiguity  of  a  if  no  eontext-sensitive  tag¬ 
ging  is  provided  (sueh  as  the  preferenee  for  a  to 
be  labeled  Inf  initive_Marker  when  followed 
by  a  Verb -Inf  ini  five).  Thus  one  avenue  of 
potential  improvement  to  these  models  would  be 
to  inelude  limited  tagged  eontexts  for  ambiguous 
small  elass  (or  singleton  elass)  words,  although  sueh 
supervision  is  less  readily  extraetable  from  gram¬ 
mar  books  by  non-native  speakers,  and  was  not  em¬ 
ployed  here. 


6-" 


Relative  Position 


4.1  Contextual-agreement  models  for 
part-of-speech  subtags 

Traditional  part-of-speeeh  models  assume  a  striet 
Markovian  sequential  dependeney.  However,  Adj- 
Noun,  Det-Noun  and  Noun- Verb  agreement  at  the 
subtag-level  (e.g.  for  person,  number,  ease  and  gen¬ 
der)  often  do  not  require  direet  adjaeeney,  and  are 
based  on  the  seleetive  matehing  of  isolated  subfea¬ 
tures.  This  is  partieularly  important  for  grammatieal 
gender,  where  the  laek  of  gender  features  projeeted 
from  English  rootwords  in  a  bilingual  dietionary  (as 
in  Seetion  2)  require  eontextual  agreement  to  assign 
gender  to  many  infleeted  and  root  forms. 

However,  given  the  assumptions  of  minimal  su¬ 
pervision,  it  is  not  reasonable  to  require  a  parser  or 
dependeney  model  to  identify  non-adjaeent  agree¬ 
ing  pairs  explieitly.  Rather,  we  utilize  a  mueh  more 
general  tendeney  for  words  exhibiting  a  property 
sueh  as  grammatieal  gender  to  eo-oeeur  in  a  rela¬ 
tively  narrow  window  with  other  words  of  the  same 
gender  (ete.)  with  a  probability  greater  than  ehanee. 
Empirieally,  we  observe  this  in  Eigures  4-5,  whieh 
show  the  gender-agreement  ratio  between  a  target 
noun/adjeetive  and  other  gender  marked  words  ap¬ 
pearing  in  eontext  at  relative  position  ±*.  Adjee- 
tives  in  Romanian  exhibit  a  stronger  agreement  ten¬ 
deney  with  words  to  their  left  (5/1  ratio),  while  for 
nouns  the  agreement  ratio  is  quite  elosely  balaneed 
between  -1  (primarily  determiners)  and  -i-l  (primar¬ 
ily  adjeetives),  although  weaker  (2.4/1  ratio),  per¬ 
haps  due  to  a  greater  relative  tendeney  for  nouns  to 
juxtapose  direetly  with  other  independent  elauses  of 
different  gender.  Also,  both  parts  of  speeeh  eon- 


Eigure  4:  Ratio  of  the  frequeney  that  a  gender- 
marked  adjective  (above)  or  noun  (below)  agrees 
in  gender  with  another  noun/adjeetive/determiner  at 
relative  position  i  over  the  frequeney  of  gender  dis¬ 
agreement  at  that  relative  position. 


Eigure  5:  The  probability  that  at  least  one  gender- 
marked  word  will  oeeur  within  a  window  of  ±i 
words  relative  to  another  gender  marked  word  (of 
any  part  of  speeeh). 

verge  on  the  agreement  ratio  expeeted  by  ehanee 
(0.82)  relatively  quiekly.  Thus  while  any  individ¬ 
ual  eontext  may  suggest  ineorreet  gender  based  on 
agreement,  if  one  aggregates  over  all  oeeurrenees  of 
a  word  in  a  eorpus,  a  eonsensus  gender  preferenee 
emerges,  with  the  true  gender  agreement  signal  ex- 
eeeding  nearby  spurious  gender  noise. 

Eormally,  we  ean  model  this  window-weighted 
global  feature  eonsensus  as: 

1 

P{Genk\w)  =  —  ^  P{Genk\wi+j)Wt{j) 

iEloc{w)  i=— 3 


The  ±3  window-size  parameter  was  seleeted 
prior  to  the  studies  shown  in  Figures  4-5,  but  is  sup¬ 
ported  by  them.  Beyond  this  window  the  agree¬ 
ment/disagreement  ratio  approaehes  ehanee,  but 
with  a  smaller  window  the  probability  of  finding  any 
gender-marked  word  in  the  window  drops  below  the 
80%  eoverage  observed  for  ±3,  trading  lower  eov- 
erage  for  inereased  aeeuraey. 

If  one  makes  the  assumption  that  the  overwhelm¬ 
ing  majority  of  nouns  have  a  single  grammatieal 
gender  independent  of  eontext,  we  perform  smooth¬ 
ing  to  foree  nouns  with  suffieient  global  eontext  fre- 
queney  towards  their  single  most  likely  gender. 

Finally,  the  trie-based  suffix  model  nofed  in  See- 
lion  3  ean  be  utilized  here  fo  furlher  generalize  gen¬ 
der  affixal  lendeneies  for  use  in  smoofhing  poorly 
represenled  single  words.  Through  Ibis  approaeh 
we  sueeessfully  diseover  a  wide  spaee  of  low- 
enlropy  gender  affix  lendeneies,  ineluding  Ihe  eom- 
mon  -a,  -dad  and  -cion  feminine  affixes  in  Span¬ 
ish,  wilhouf  any  human  or  diefionary  supervision 
of  nominal  gender.  Bui  even  Ihose  words  wilh- 
oul  gender-dislinguishing  affixes  (e.g.  parte,  cabal) 
ean  be  sueeessfully  learned  via  global  eonlexl  max¬ 
imization. 

5  Evaluation  of  the  Full  Part-of-speech 
Tagger 

One  problem  wilh  minimally  supervised  learning  of 
foreign  languages  is  lhal  annolafed  evaluafion  dafa 
are  oflen  nof  available  for  Ihe  fealures  being  in- 
dueed,  or  are  olherwise  diffieull  fo  obfain.  Thus  we 
have  used  for  inilial  lesl  languages  Iwo  languages 
familiar  fo  Ihe  aulhors  (Romanian  and  Spanish)  for 
whieh  suffieienl  evaluafion  resourees  eould  be  ob- 
lained.  However,  Ihe  monolingual  eorpora  ulilized 
for  boolslrapping  were  quile  small  (123  Ihousand 
words  of  Ihe  book  1984  for  Romanian  and  3.2  mil¬ 
lion  words  of  newswire  for  Spanish),  whieh  are  eas¬ 
ily  eomparable  fo  Ihe  sizes  lhal  ean  be  aeeessed  on¬ 
line  for  60-100  world  languages.  The  seed  dielio- 
naries  were  loeafed  online  (for  Spanish  -  42k  en- 
Iries)  and  via  OCR  (for  Romanian  -  7k  enlries),  and 
small  grammar  referenees  were  oblained  al  a  loeal 
bookslore.  1000  words  of  lesl  dafa  were  annolafed 
wilh  a  slandardized,  finely  delailed  parl-of-speeeh 
fag  invenlory  ineluding  Ihe  full  eomplex  dislinelions 
for  gender,  person,  number,  ease,  delailed  lense  and 
nominal  definileness  (an  invenlory  of  259  and  230 
fine-grained  lags  were  used  for  Spanish  and  Roma¬ 
nian  respeelively). 

The  minimal  supervision  in  Ihis  sludy  eonsisfed 
of  an  average  lolal  of  4  person-hours  per  language 


for  manually  entering  Ihe  infleelional  paradigms 
and  assoeialed  parls  of  speeeh  from  a  grammar  as 
in  Seelion  3,  and  an  addilional  average  of  3  person- 
hours  per  language  for  diefionary  exlraelion  and  en- 
Iry  parsing.  OCR  ilself  on  our  high-speed  2-sided 
seanner  wilh  OmniPage  Pro  look  under  30  min¬ 
utes).  As  would  be  expeefed  given  lhal  dafa  en- 
Iry  was  done  by  eompuler  seienlisls  whieh  were 
nof  native  speakers  of  Ihe  lesl  languages,  signilieanl 
analysis  errors  or  gaps  were  inlrodueed  when  ralher 
blindly  Iransferring  from  Ihe  referenee  grammar. 
Thus  lo  lesl  Ihe  relative  eonlribulions  of  limited  na¬ 
tive  speaker  help  when  available,  for  roughly  4  addi¬ 
tional  lolal  person  hours  in  a  seeond  lesl  eondilion 
for  Romanian  a  native  speaker  eorreeled  and  aug¬ 
mented  gaps  in  Ihe  pallerns  previously  entered  from 
Ihe  grammar  book,  foeusing  almosl  exelusively  on 
Ihe  eomplex  infleelions  of  elosed-elass  words. 

A  summary  of  Ihe  resulls  for  Ihese  Ihree  super¬ 
vision  modes  is  given  in  Table  3.  Performanee  is 
broken  down  by  fine-grained  pari  of  speeeh.  Exael- 
maleh  aeeuraey  is  measured  over  bolh  Ihe  full  fine¬ 
grained  (up  lo  5-fealure)  parl-of-speeeh  spaee,  as 
well  as  Ihe  12-elass  eore  POS  lag  (noun  and  proper 
noun,  pronoun,  verb,  adjeelive,  adverb,  numeral, 
determiner,  eonjunelion,  preposition,  inlerjeelion, 
parliele,  punelualion).  The  fealure  of  grammatieal 
gender  was  speeifieally  isolated  beeause  if  is  rarely 
salienl  for  eross-language  appliealions  sueh  as  ma- 
ehine  Iranslalion  (where  grammatieal  gender  rarely 
Iransfers),  and  beeause  ils  induelion  algorilhm  in 
Seelion  4. 1  depends  heavily  on  Ihe  size  of  Ihe  mono¬ 
lingual  eorpus  (whieh  is  small  in  Ihese  experimenls, 
suggesting  size-dependenl  potential  for  signilieanl 
furlher  improvemenl  here). 

Finally,  a  posl-hoe  analysis  of  Ihe  system  vs.  lesl 
dala  diserepaneies  showed  lhal  a  signilieanl  number 
were  simply  arbilrary  differenees  in  annolalion  eon- 
venlion  belween  Ihe  grammar-book  analyses  and 
Ihe  lesl  dala  lagging  poliey.  For  example,  one  sueh 
“error’Vdiserepaney  is  Ihe  ralher  arbilrary  disline- 
lion  of  whelher  Ihe  Romanian  word  oricare  (mean¬ 
ing  any)  should  be  eonsidered  an  adjeelive  (as  listed 
in  a  slandard  bilingual  diefionary)  or  a  determiner. 
Anolher  differenee  is  whelher  proper-name  eilalions 
of  eommon  nouns  (e.g.  Casa  Blanca)  should  be  an- 
nolaled  for  gender/number  ele.  or  not 

Yel  regardless  of  exaelly  how  many  syslem-lesl 
diserepaneies  are  jusl  poliey  differenees  ralher  lhan 
errors,  even  Ihe  raw  aeeuraey  here  is  very  promising 
given  Ihe  very  lined-grained  parl-of-speeeh  inven¬ 
lory  and  small  monolingual  dala  size  used  for  bool¬ 
slrapping.  And  ultimately  Ihe  performanee  is  quite 


Spanish 

Romanian 

NNS 

8h 

NNS 

8h 

NNS-8h 

NS-4h 

All  words 

core-tag 

93.1 

86.3 

89.2 

exact-match 

86.5 

68.6 

75.5 

exact  w/o  gender 

87.0 

76.7 

83.0 

Nouns 

core-tag 

90.3 

97.4 

97.4 

*number 

100.0 

97.4 

98.9 

*gender 

100.0 

54.9 

64.7 

*definiteness 

- 

96.6 

93.7 

*case 

- 

97.4 

97.4 

Verbs 

core-tag 

94.7 

87.9 

89.5 

*tense 

93.0 

92.6 

93.2 

*number 

100.0 

91.5 

91.2 

*person 

97.2 

92.6 

93.2 

Adjectives 

core-tag 

79.7 

78.6 

81.5 

*gender 

100.0 

81.3 

82.2 

*number 

100.0 

98.3 

98.3 

Table  3:  Performance  of  POS  tagger  induction 
based  on  1  person-day  of  supervision,  no  tagged 
training  corpora  and  a  fine-grained  («250  tags) 
tagset.  NNS  and  NN  refer  to  non-native-speaker  and 
native-speaker  effort. 

remarkable  given  that  it  is  the  result  of  less  than  1 
total  person  day  of  data  collection  and  supervision, 
in  contrast  to  the  thousands  of  hours  and  $100,000- 
$1,000,000  spent  on  some  annotated  training  data 
in  a  much  more  limited  tagset  inventories.  Thus 
in  terms  of  cost-benefit  analysis,  the  supervision 
paradigm  and  associated  bootstrapping  models  pre¬ 
sented  here  offer  quite  a  good  value  of  new  func¬ 
tionality  per  labor  invested. 

6  Conclusion 

This  paper  has  presented  an  alternative  to  tradi¬ 
tional  corpus  annotation-based  supervision  of  part- 
of-speech  taggers.  Given  that  even  obscure  lan¬ 
guages  have  reference  grammars  and  dictionaries 
available  in  large  bookstores,  libraries  or  even  on¬ 
line,  the  focus  of  this  work  is  on  using  human  su¬ 
pervision  for  efficient  structured  entry  of  this  seed 
knowledge  (in  the  form  of  regular  and  semi-regular 
inflectional  paradigms  and  often  irregular  closed- 
class  part-of-speech  entries).  Minimally  supervised 
bootstrapping  procedures  then  used  corpus-derived 
distributional  data  to  induce  lexical  tag  probabilities 
from  dictionaries,  irregular  morphological  analyses 


via  weighted  Levenshtein-based  alignment  models, 
tag  sequence  probability  induction  and  grammati¬ 
cal  gender  agreement  modeling.  Experiments  show 
high  accuracy  coarse  and  fine-grained  («  250  fag) 
parf-of-speech  analyses  using  only  one  person  day 
of  new  human  supervision  based  on  readily  avail¬ 
able  linguisfic  resources. 
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